Wisconsin Breast Cancer Binary Classification

  • Goal: Predict breast cancer from tissue samples. The best published results appear to be in the high 90% accuracy such as 96% and 97%.
  • Dataset: From mlbench package. Each record in the dataset represents one breast cancer tissue sample. 150 observations
  • Predictors: 4 features
  • Target: Class: benign or malignant

Results

~ 100% accuracy


In [8]:
source("C:\\Work\\myRfunctions.R")
fnRunDate()
fnInstallPackages()


'Project last run on Tue Sep 19 4:55:46 PM 2017'
'Package install completed'

In [9]:
# Load data
data(BreastCancer)
# rename the dataset
dataset <- as_tibble(BreastCancer)

In [10]:
glimpse(dataset)


Observations: 699
Variables: 11
$ Id              <chr> "1000025", "1002945", "1015425", "1016277", "101702...
$ Cl.thickness    <ord> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, 8, 7, 4, ...
$ Cell.size       <ord> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1, 7, 4, 1,...
$ Cell.shape      <ord> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1, 5, 6, 1,...
$ Marg.adhesion   <ord> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, 10, 4, 1,...
$ Epith.c.size    <ord> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, 7, 6, 2, ...
$ Bare.nuclei     <fctr> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, 3, 9, 1,...
$ Bl.cromatin     <fctr> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, 5, 4, 2,...
$ Normal.nucleoli <fctr> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, 5, 3, 1,...
$ Mitoses         <fctr> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, 1, 1,...
$ Class           <fctr> benign, benign, benign, benign, benign, malignant,...

Besides the Id, the attributes are factors. For modeling, it may be more useful to work with the data as numbers than factors.


In [11]:
# Converting columns to numeric using "tidyverse"
dataset[, 1:10] <- dplyr::mutate_if(dplyr::select(dataset, 1:10), is.factor, as.numeric)

str(dataset)


Classes 'tbl_df', 'tbl' and 'data.frame':	699 obs. of  11 variables:
 $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
 $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
 $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
 $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
 $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
 $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
 $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
 $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
 $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

In [12]:
# Id can't be used for prediction
dataset$Id <- NULL

In [13]:
psych::describe(dataset, check = T)


varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
Cl.thickness 1 699 4.417740 2.8157407 4 4.151515 2.9652 1 10 9 0.5903165 -0.634616740.10650109
Cell.size 2 699 3.134478 3.0514591 1 2.556150 0.0000 1 10 9 1.2278492 0.080692600.11541678
Cell.shape 3 699 3.207439 2.9719128 1 2.670232 0.0000 1 10 9 1.1568774 -0.010183440.11240806
Marg.adhesion 4 699 2.806867 2.8553792 1 2.185383 0.0000 1 10 9 1.5179315 0.960965320.10800036
Epith.c.size 5 699 3.216023 2.2142999 2 2.778966 0.0000 1 10 9 1.7048304 2.130300010.08375251
Bare.nuclei 6 683 3.544656 3.6438572 1 3.058501 0.0000 1 10 9 0.9856714 -0.808212230.13942832
Bl.cromatin 7 699 3.437768 2.4383643 3 3.103387 1.4826 1 10 9 1.0952527 0.165654770.09222741
Normal.nucleoli 8 699 2.866953 3.0536339 1 2.233512 0.0000 1 10 9 1.4161630 0.452412170.11549904
Mitoses 9 699 1.569385 1.6198026 1 1.119430 0.0000 1 9 8 3.3596705 10.880870450.06126656
Class*10 699 1.344778 0.4756363 1 1.306595 0.0000 1 2 1 0.6517572 -1.577463150.01799022

There are 13 NA values for Bare.nuclei. May need to remove or impute those for some analysis and modeling.

All attributes have integer values in the range [1,10]. So may not see much benefit from normalizing attributes for instance-based methods like KNN.

There is some imbalance in the Class values.


In [14]:
# summarize the class distribution
fnClassDistribution(Class = dataset$Class)


freqpercentage
benign458 65.52217
malignant241 34.47783

Let’s look at the correlation between the attributes. We have to exclude the 13 rows with NA values (incomplete cases) when calculating the correlations.


In [15]:
# summarize correlations between input variables
  options(warn=-1) 
  PerformanceAnalytics::chart.Correlation(dplyr::select_if(dataset, is.numeric), histogram=TRUE, pch=".")


I see some modest to high correlation between some of the attributes, like cell shape and cell size at 0.91 correlation. Some algorithms may benefit from removing the highly correlated attributes.

Almost all of the distributions have an exponential or bimodal shape to them. We may benefit from log transforms or other power transforms later on.


In [191]:
# scatterplot matrix
#trellis.par.set(theme = col.whitebg(), warn = FALSE)
caret::featurePlot(x=dataset[, 1:5], y=dataset$Class, plot="ellipse")


The green (benign) a part to be clustered around the bottom-left corner (smaller values) and red (malignant) are all over the place.


In [192]:
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=dataset[, 1:9], y=dataset$Class, plot="density", scales=scales)



In [193]:
# scatterplot matrix
caret::featurePlot(x=dataset[, 1:9], y=dataset$Class, plot="box")



In [194]:
caret::featurePlot(x=dataset[, 1:9], y=dataset$Class, plot="strip", jitter = TRUE)



In [195]:
str(dataset)


Classes 'tbl_df', 'tbl' and 'data.frame':	699 obs. of  10 variables:
 $ Cl.thickness   : num  5 5 3 6 4 8 1 2 2 4 ...
 $ Cell.size      : num  1 4 1 8 1 10 1 1 1 2 ...
 $ Cell.shape     : num  1 4 1 8 1 10 1 2 1 1 ...
 $ Marg.adhesion  : num  1 5 1 1 3 8 1 1 1 1 ...
 $ Epith.c.size   : num  2 7 2 3 2 7 2 2 2 2 ...
 $ Bare.nuclei    : num  1 10 2 4 1 10 10 1 1 1 ...
 $ Bl.cromatin    : num  3 3 3 3 3 9 3 3 1 2 ...
 $ Normal.nucleoli: num  1 2 1 7 1 7 1 1 1 1 ...
 $ Mitoses        : num  1 1 1 1 1 1 1 1 5 1 ...
 $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

In [196]:
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(7)
validation_index <- createDataPartition(dataset$Class, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- dataset[validation_index,]

In [197]:
formula <- Class ~ .

In [201]:
# Evaluate Algorithms

# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

# LG
set.seed(7)
fit.glm <- train(formula, data=dataset, method="glm", metric=metric, trControl=control, na.action=na.pass)
# LDA
set.seed(7)
fit.lda <- train(formula, data=dataset, method="lda", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# GLMNET
set.seed(7)
fit.glmnet <- train(formula, data=dataset, method="glmnet", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# KNN
set.seed(7)
fit.knn <- train(formula, data=dataset, method="knn", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# CART
set.seed(7)
fit.cart <- train(formula, data=dataset, method="rpart", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# Naive Bayes
set.seed(7)
fit.nb <- train(formula, data=dataset, method="nb", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# SVM
set.seed(7)
fit.svm <- train(formula, data=dataset, method="svmRadial", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn, CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(results)
dotplot(results)


Call:
summary.resamples(object = results)

Models: LG, LDA, GLMNET, KNN, CART, NB, SVM 
Number of resamples: 30 

Accuracy 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LG     0.9091  0.9617 0.9636 0.9624  0.9815 1.0000    0
LDA    0.8909  0.9457 0.9640 0.9559  0.9649 1.0000    0
GLMNET 0.8929  0.9467 0.9646 0.9631  0.9821 1.0000    0
KNN    0.8929  0.9507 0.9646 0.9667  0.9821 1.0000    0
CART   0.8571  0.9091 0.9286 0.9267  0.9474 0.9821    0
NB     0.8750  0.9340 0.9636 0.9554  0.9821 1.0000    0
SVM    0.8750  0.9286 0.9474 0.9494  0.9643 1.0000    0

Kappa 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LG     0.8014  0.9159 0.9196 0.9174  0.9577 1.0000    0
LDA    0.7462  0.8763 0.9182 0.9010  0.9214 1.0000    0
GLMNET 0.7610  0.8823 0.9223 0.9181  0.9597 1.0000    0
KNN    0.7610  0.8933 0.9221 0.9266  0.9607 1.0000    0
CART   0.6730  0.7969 0.8419 0.8380  0.8831 0.9607    0
NB     0.7380  0.8568 0.9200 0.9039  0.9607 1.0000    0
SVM    0.7380  0.8474 0.8883 0.8918  0.9235 1.0000    0

Good accuracy across the board. All algorithms have a mean accuracy above 90%, well above the baseline of 65% if we just predicted benign. The problem is learnable.

Some predictores have skewed distributions. I'll try normalizing using BoxCox.


In [202]:
# Evaluate Algorithms Transform

# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(formula, data=dataset, method="glm", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# LDA
set.seed(7)
fit.lda <- train(formula, data=dataset, method="lda", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# GLMNET
set.seed(7)
fit.glmnet <- train(formula, data=dataset, method="glmnet", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# KNN
set.seed(7)
fit.knn <- train(formula, data=dataset, method="knn", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# CART
set.seed(7)
fit.cart <- train(formula, data=dataset, method="rpart", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Naive Bayes
set.seed(7)
fit.nb <- train(formula, data=dataset, method="nb", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# SVM
set.seed(7)
fit.svm <- train(formula, data=dataset, method="svmRadial", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Compare algorithms
transform_results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn, CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(transform_results)
dotplot(transform_results)


Call:
summary.resamples(object = transform_results)

Models: LG, LDA, GLMNET, KNN, CART, NB, SVM 
Number of resamples: 30 

Accuracy 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LG     0.9107  0.9467 0.9643 0.9620  0.9821 1.0000    0
LDA    0.9107  0.9467 0.9643 0.9649  0.9821 1.0000    0
GLMNET 0.9107  0.9507 0.9643 0.9649  0.9821 1.0000    0
KNN    0.9107  0.9636 0.9649 0.9708  0.9824 1.0000    0
CART   0.8571  0.9091 0.9286 0.9267  0.9474 0.9821    0
NB     0.9286  0.9467 0.9734 0.9661  0.9821 1.0000    0
SVM    0.9286  0.9514 0.9734 0.9696  0.9824 1.0000    0

Kappa 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LG     0.8082  0.8861 0.9219 0.9163  0.9597 1.0000    0
LDA    0.8082  0.8858 0.9219 0.9240  0.9607 1.0000    0
GLMNET 0.8082  0.8952 0.9219 0.9231  0.9601 1.0000    0
KNN    0.8082  0.9203 0.9247 0.9369  0.9616 1.0000    0
CART   0.6730  0.7969 0.8419 0.8380  0.8831 0.9607    0
NB     0.8447  0.8842 0.9420 0.9263  0.9607 1.0000    0
SVM    0.8484  0.8961 0.9422 0.9346  0.9616 1.0000    0

There was definite improvement using BoxCox transformation.

The best performer was SVM. I'll tune that algorithm.

Tuning SVM

The SVM implementation has two parameters that we can tune with the caret package: sigma which is a smoothing term and C which is a cost constraint. You can learn more about these parameters in the help for the ksvm() function ?ksvm. Let’s try a range of values for C between 1 and 10 and a few small values for sigma around the default of 0.1.


In [203]:
# Tune SVM

# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1))
fit.svm <- train(formula, data=dataset, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
print(fit.svm)
plot(fit.svm)


Support Vector Machines with Radial Basis Function Kernel 

547 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

Pre-processing: Box-Cox transformation (9), median imputation (9) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 503, 504, 505, 503, 505, 504, ... 
Resampling results across tuning parameters:

  sigma  C   Accuracy   Kappa    
  0.025   1  0.9654821  0.9254749
  0.025   2  0.9672573  0.9292142
  0.025   3  0.9642916  0.9226964
  0.025   4  0.9619211  0.9175831
  0.025   5  0.9625163  0.9188285
  0.025   6  0.9636964  0.9214701
  0.025   7  0.9642916  0.9227474
  0.025   8  0.9642916  0.9226863
  0.025   9  0.9636964  0.9213061
  0.025  10  0.9636964  0.9212328
  0.050   1  0.9672573  0.9291806
  0.050   2  0.9660773  0.9266555
  0.050   3  0.9666726  0.9279668
  0.050   4  0.9666834  0.9279778
  0.050   5  0.9660881  0.9266354
  0.050   6  0.9660881  0.9266354
  0.050   7  0.9660881  0.9266354
  0.050   8  0.9660881  0.9266354
  0.050   9  0.9648868  0.9239636
  0.050  10  0.9654821  0.9253438
  0.100   1  0.9684482  0.9318342
  0.100   2  0.9678739  0.9307310
  0.100   3  0.9654929  0.9253581
  0.100   4  0.9642808  0.9228357
  0.100   5  0.9631007  0.9203068
  0.100   6  0.9619103  0.9176851
  0.100   7  0.9601025  0.9134931
  0.100   8  0.9576999  0.9081262
  0.100   9  0.9559030  0.9040844
  0.100  10  0.9559030  0.9040844
  0.150   1  0.9690539  0.9333390
  0.150   2  0.9660881  0.9267030
  0.150   3  0.9631007  0.9202715
  0.150   4  0.9606981  0.9148318
  0.150   5  0.9565095  0.9054406
  0.150   6  0.9559138  0.9040976
  0.150   7  0.9547125  0.9013287
  0.150   8  0.9541277  0.9000002
  0.150   9  0.9547230  0.9013765
  0.150  10  0.9541277  0.9000316

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.15 and C = 1.

We can see that we have made very little difference to the results. The most accurate model had a score of in the 97 range using a sigma = 0.1 and C = 1. I could tune further, but I don’t expect a payoff.

Tuning KNN

The KNN algorithm has one parameter that we can tune with caret: k, the number of closest instances to collect in order to make a prediction. Let’s try all k values between 1 and 20.


In [204]:
# Tune kNN

# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.k=seq(1,20,by=1))
fit.knn <- train(formula, data=dataset, method="knn", metric=metric, tuneGrid=grid, preProc=c("BoxCox","knnImpute"), trControl=control, na.action=na.pass)
print(fit.knn)
plot(fit.knn)


k-Nearest Neighbors 

547 samples
  9 predictor
  2 classes: 'benign', 'malignant' 

Pre-processing: Box-Cox transformation (9), nearest neighbor imputation
 (9), centered (9), scaled (9) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 503, 504, 505, 503, 505, 504, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   1  0.9458056  0.8798375
   2  0.9523958  0.8948365
   3  0.9648544  0.9236234
   4  0.9672357  0.9287734
   5  0.9702335  0.9355476
   6  0.9648551  0.9238118
   7  0.9636964  0.9212404
   8  0.9643132  0.9228175
   9  0.9654716  0.9249342
  10  0.9666517  0.9278189
  11  0.9666729  0.9278953
  12  0.9672473  0.9290644
  13  0.9678425  0.9304068
  14  0.9684486  0.9316984
  15  0.9690546  0.9330217
  16  0.9690334  0.9329993
  17  0.9678638  0.9304022
  18  0.9672686  0.9290578
  19  0.9672794  0.9290070
  20  0.9672794  0.9290070

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was k = 5.

Tuning made a little difference for KNN

Ensemble Methods

As a final check, I'll look at some boosting and bagging ensemble algorithms on the dataset.


In [205]:
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# Bagged CART
set.seed(7)
fit.treebag <- train(formula, data=dataset, method="treebag", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# Random Forest
set.seed(7)
fit.rf <- train(formula, data=dataset, method="rf", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(formula, data=dataset, method="gbm", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, verbose=FALSE, na.action=na.pass)
# C5.0
set.seed(7)
fit.c50 <- train(formula, data=dataset, method="C5.0", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Compare results
ensemble_results <- resamples(list(BAG=fit.treebag, RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensemble_results)
dotplot(ensemble_results)


Call:
summary.resamples(object = ensemble_results)

Models: BAG, RF, GBM, C50 
Number of resamples: 30 

Accuracy 
      Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
BAG 0.8750  0.9464 0.9636 0.9541  0.9649    1    0
RF  0.9286  0.9636 0.9643 0.9679  0.9821    1    0
GBM 0.9107  0.9464 0.9643 0.9613  0.9821    1    0
C50 0.8750  0.9457 0.9636 0.9559  0.9649    1    0

Kappa 
      Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
BAG 0.7380  0.8794 0.9182 0.8992  0.9230    1    0
RF  0.8484  0.9186 0.9219 0.9299  0.9601    1    0
GBM 0.8082  0.8833 0.9189 0.9148  0.9597    1    0
C50 0.7315  0.8812 0.9189 0.9033  0.9230    1    0

In [207]:
set.seed(13)
predictions <- predict(fit.knn, newdata=validation, na.action=na.pass)
confusionMatrix(predictions, validation$Class)


Confusion Matrix and Statistics

           Reference
Prediction  benign malignant
  benign        89         0
  malignant      2        48
                                         
               Accuracy : 0.9856         
                 95% CI : (0.949, 0.9983)
    No Information Rate : 0.6547         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.9685         
 Mcnemar's Test P-Value : 0.4795         
                                         
            Sensitivity : 0.9780         
            Specificity : 1.0000         
         Pos Pred Value : 1.0000         
         Neg Pred Value : 0.9600         
             Prevalence : 0.6547         
         Detection Rate : 0.6403         
   Detection Prevalence : 0.6403         
      Balanced Accuracy : 0.9890         
                                         
       'Positive' Class : benign         
                                         

In [ ]: